Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6041: [Website] Blog post announcing R library availability on CRAN #4948

Closed

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Jul 25, 2019

No description provided.


```r
library(arrow)
df <- read_parquet("path/to/file.parquet")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to use a real/interesting example Parquet file here--anyone know of any? I found a few online but they're all multi-file partitioned things, which we don't have good support for in R yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always like to use the New York Taxi trip dataset for Parquet file usage as a month of data has a decent size but loads very quickly, sadly there is no official source for a Parquet file for it.

write_parquet(df, "path/to/different_file.parquet")
```

## Feather files
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm you may especially want to review this section for historical accuracy and current policy stance.

@wesm
Copy link
Member

wesm commented Jul 25, 2019

Will review. We'll have to be careful about what we call a "release" on this blog, since that has a very specific meaning in Apache-land. When in doubt, say "Available on CRAN" rather than "Released on CRAN"

{% endcomment %}
-->

We are very excited to announce that the `arrow` R package is now available on CRAN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to CRAN (for people who don't know what that is)?

install.packages("arrow")
```

On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "list of PPAs" is a bit too specific. Say "See ... to find pre-compiled binary packages for some common Linux distributions such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the libraries from source."

for version- and platform-specific guidance on installing the Arrow C++
library.

## Parquet files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "Apache Parquet support" here


## Parquet files

This release introduces read and write support for the [Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to qualify that this is "preliminary" read and write support that is in its early stages of development. Otherwise you're setting the wrong expectations. It would be accurate (and helpful) to state that the Python Arrow library has much richer support for Parquet files, including multi-file datasets, and we hope to achieve feature equivalency in the next 12 months.


## Feather files

This release also includes full support for the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial products coming out of the Arrow project, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python.
Copy link
Member

@wesm wesm Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's accurate to say "includes a much faster implementation of the Feather file format"

when you say "initial products coming out of the Arrow project" -- it didn't actually. Perhaps say "was one of the initial applications of Apache Arrow for Python and R".


With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages.

We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to say that we will look at adapting the "feather" package to be based on "arrow" (though this could upset some users).


We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`.

Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? We currently recommend Parquet for long-term storage, as well as for cases where the size on disk matters because Parquet supports various compression formats. Feather, on the other hand, may be faster to read in because it matches the in-memory format and doesn't require deserialization, and it also allows for memory mapping so that you can access data that is larger than can fit into memory. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you say "Parquet supports various compression formats" it might bring up some canards with the R community. It's simpler to say that "Parquet is optimized to create small files and as a result can be more expensive to read locally, but it performs very well with remote storage like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and read in Arrow format without any deserialization while Parquet files always must be decompressed and decoded."


## Parquet files

This release introduces read and write support for the [Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time you reference "Spark" in the article -- you need to use "Apache Spark"


## Other capabilities

In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid peanuts being hurled from the gallery, you may want to state here that the functions like read_csv_arrow are being developed to optimize for the memory layout of the Arrow columnar format, and are not intended as a replacement for "native" functions that return R data.frame, for example.

@@ -0,0 +1,87 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change filename to remove "release"

@wesm wesm changed the title ARROW-6041: [Website] Blog post announcing R package release ARROW-6041: [Website] Blog post announcing R library availability on CRAN Aug 8, 2019
@wesm wesm closed this in d63fe6f Aug 8, 2019
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Aug 11, 2019
…CRAN

Closes apache#4948 from nealrichardson/blog-cran-release and squashes the following commits:

7c8254b <Wes McKinney> Add note about nokogiri requirements
3b06bb4 <Wes McKinney> Update date, small language tweaks
fe98d6a <Neal Richardson> Add macOS R installation warning
b5d9e73 <Neal Richardson> Merge upstream/master
c5dd6fa <Neal Richardson> Incorporate Wes's revisions
ddb1857 <Neal Richardson> Add self to contributors.yml; remove thoughtcrime from post title
06c06e2 <Neal Richardson> First draft of R package release announcement

Lead-authored-by: Neal Richardson <[email protected]>
Co-authored-by: Wes McKinney <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants